上一回提到了Retriever有許多的方法,現在就讓我來一一介紹吧
使用說明:
這個retriever使用vector store來支持檢索操作。你可以將文件的嵌入儲存在向量庫中,然後通過相似性搜索、最大邊際相關性(MMR)等方式來檢索相關的文件。
程式碼:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
# 加載文檔
loader = TextLoader("../../state_of_the_union.txt")
documents = loader.load()
# 文本切分
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# 創建嵌入和向量庫
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(texts, embeddings)
# 創建retriever
retriever = db.as_retriever()
# 使用retriever檢索
docs = retriever.invoke("what did he say about ketanji brown jackson")
其他功能:
retriever = db.as_retriever(search_type="mmr")
retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5})
retriever = db.as_retriever(search_kwargs={"k": 1})
使用說明:
MultiQueryRetriever使用LLM生成多個查詢,並從不同角度進行檢索,最後將這些結果合併,獲得更多相關文件。
程式碼:
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
# 加載數據
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()
# 分割文本
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)
# 創建向量庫
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)
# 創建retriever並使用LLM生成多個查詢
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=vectordb.as_retriever(), llm=llm)
# 檢索並獲取結果
unique_docs = retriever_from_llm.invoke("What are the approaches to Task Decomposition?")
使用說明:
Contextual Compression Retriever通過壓縮文檔內容,使檢索結果更加精準,只返回與查詢相關的信息。它通過壓縮文檔或過濾掉不相關的文檔來實現。
程式碼:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import OpenAI
# 初始化向量庫retriever
documents = TextLoader("../../state_of_the_union.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever()
# 添加上下文壓縮
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
# 檢索並壓縮結果
compressed_docs = compression_retriever.invoke("What did the president say about Ketanji Jackson Brown")
使用說明:
自定義retriever允許你根據需求實現自定義的檢索邏輯,通常需要繼承BaseRetriever
類並實現必要的方法。
程式碼:
from typing import List
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
class ToyRetriever(BaseRetriever):
"""一個簡單的retriever,它返回包含用戶查詢的文檔。"""
documents: List[Document]
k: int
def _get_relevant_documents(self, query: str, *, run_manager: CallbackManagerForRetrieverRun) -> List[Document]:
matching_documents = []
for document in self.documents:
if len(matching_documents) > self.k:
return matching_documents
if query.lower() in document.page_content.lower():
matching_documents.append(document)
return matching_documents
# 測試ToyRetriever
documents = [
Document(page_content="Dogs are great companions, known for their loyalty and friendliness."),
Document(page_content="Cats are independent pets that often enjoy their own space."),
]
retriever = ToyRetriever(documents=documents, k=1)
result = retriever.invoke("Dogs")
使用說明:
EnsembleRetriever將多個retriever的結果進行組合,並根據Reciprocal Rank Fusion算法對結果進行重新排序。這種方法可以利用不同算法的優勢,提升檢索效果。
程式碼:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
# 初始化BM25和FAISS retriever
bm25_retriever = BM25Retriever.from_texts(["I like apples", "I like oranges", "Apples and oranges are fruits"])
faiss_vectorstore = FAISS.from_texts(["You like apples", "You like oranges"], OpenAIEmbeddings())
# 創建EnsembleRetriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever.as_retriever(), faiss_vectorstore.as_retriever()])
# 檢索結果
docs = ensemble_retriever.invoke("apples")
明天將介紹完另外五個retriever方法!